Scale invariant value computation for reinforcement learning
نویسندگان
چکیده
Natural learners must compute an estimate of future outcomes that follow from a stimulus in continuous time. Critically, the learner cannot in general know a priori the relevant time scale over which meaningful relationships will be observed. Widely used reinforcement learning algorithms discretize continuous time and use the Bellman equation to estimate exponentially-discounted future reward. However, exponential discounting introduces a time scale to the computation of value, implying that the relative values of various states depend on how time is discretized. This is a serious problem in continuous time as successful learning requires prior knowledge of the solution. We discuss a recent computational hypothesis, developed based on work in psychology and neuroscience, for computing a scale-invariant timeline of future events. This hypothesis efficiently computes a model for future time on a logarithmically-compressed scale. Here we show that this model for future prediction can be used to generate a scale-invariant power-law-discounted estimate of expected future reward. The scale-invariant timeline could provide the centerpiece of a neurocognitive framework for reinforcement learning in continuous time. Introduction In reinforcement learning, an agent learns how to optimize its actions from interacting with the environment, aiming to maximize temporally-discounted future reward. In order to navigate the environment, the agent perceives stimuli that define different states. The stimuli are experienced embedded in continuous time with temporal relationships that the agent needs to learn in order to learn the optimal action policy. Temporal discounting is well justified by numerous behavioral experiments on humans and animals (see e.g. Kurth-Nelson, Bickel, and Redish (2012)) and it is useful in numerous practical applications (see e.g. Mnih et al. (2015)). If the value of a state is defined as expected future reward discounted with an exponential function of future time, value can be updated in a recursive fashion, following the Bellman equation (Bellman, 1957). The Bellman equation is a foundation of highly successful and widely used modern reinforcement learning approaches such as dynamic programming and temporal difference (TD) learning (Sutton and Barto, 1998). Copyright c © 2017, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Exponential temporal discounting is not scale-invariant When using the Bellman equation (or exponential discounting in general), values assigned to the states will depend on the chosen discretization of the temporal axis in a non-linear fashion. Consequently the ratio of the values attributed to the states changes as a function of the chosen temporal resolution and the base of the exponential function. To illustrate this let us define the value of a state s observed at time t as a sum of expected rewards r discounted with an exponential function:
منابع مشابه
Estimating scale-invariant future in continuous time
Natural learners must compute an estimate of future outcomes that follow from a stimulus in continuous time. Critically, the learner cannot in general know a priori the relevant time scale over which meaningful relationships will be observed. Widely used reinforcement learning algorithms discretize continuous time and use the Bellman equation to estimate exponentially-discounted future reward. ...
متن کاملFlexible State-dependant Machine Scheduling Problems Using Reinforcement Learning
This paper presents a simulation-based optimization methodology called reinforcement learning (RL) and suggests a neural approach to approximate the values when the systems under study are complex and involve large-scale decision-making sequential tasks. Computer simulation based reinforcement learning (RL) methods of stochastic approximation have been proposed in recent years as viable alterna...
متن کاملDynamic Obstacle Avoidance by Distributed Algorithm based on Reinforcement Learning (RESEARCH NOTE)
In this paper we focus on the application of reinforcement learning to obstacle avoidance in dynamic Environments in wireless sensor networks. A distributed algorithm based on reinforcement learning is developed for sensor networks to guide mobile robot through the dynamic obstacles. The sensor network models the danger of the area under coverage as obstacles, and has the property of adoption o...
متن کاملMulti-Time Models for Reinforcement Learning
Reinforcement learning can be used not only to predict rewards, but also to predict states, i.e. to learn a model of the world's dynamics. Models can be deened at diierent levels of temporal abstraction. Multi-time models are models that focus on predicting what will happen, rather than when a certain event will take place. Based on multi-time models, we can deene abstract actions, which enable...
متن کاملWidth invariant approximation of fuzzy numbers
In this paper, we consider the width invariant trapezoidal and triangularapproximations of fuzzy numbers. The presented methods avoid the effortful computation of Karush-Kuhn-Tucker Theorem. Some properties of the new approximation methods are presented and the applicability of the methods is illustrated by examples. In addition, we show that the proposed approximations of fuzzy numbers preserv...
متن کامل